05. Hill Climbing Pseudocode

Hill Climbing Pseudocode

M3 L2 C05 V1

## What's the difference between G and J?

You might be wondering: what's the difference between the return that the agent collects in a single episode (G, from the pseudocode above) and the expected return J?

Well … in reinforcement learning, the goal of the agent is to find the value of the policy network weights \theta that maximizes expected return, which we have denoted by J.

In the hill climbing algorithm, the values of \theta are evaluated according to how much return G they collected in a single episode. To see that this might be a little bit strange, note that due to randomness in the environment (and the policy, if it is stochastic), it is highly likely that if we collect a second episode with the same values for \theta, we'll likely get a different value for the return G. Because of this, the (sampled) return G is not a perfect estimate for the expected return J, but it often turns out to be good enough in practice.